As required, this task was an open one, so the students had to choose a specific topic on their own. Our Group did choose a dataset we found on https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset#subset. This subset Contains 10k Music files and is around 2GB big. The actual dataset is about 300GB big and has arround 1 MIllion entries, in this case songs. Besids the Analysis, the dataset includes some Metadata, like Author, produced year etc. and finally music data features for each song in HDF5 format. The actual Provider of this data set is THE ECHO NEST (http://the.echonest.com), which used to be a music intelligence and data platform for developers until Spotify (https://www.spotify.com/de/), a famous music streaming provider, acquired The Echo Nest.1 As provided by the information about the dataset, it is a result of an collaboration between THE ECHO NEST and LabROSA (https://labrosa.ee.columbia.edu). 2
Our goal in this Project is going to be an analysis of some songfiles that we prefer. Since all of the musicfiles are labeled with artist- and songnames as well as the year of production, we can find allmost every song eather on YouTube (https://www.youtube.com) or on Spotify (https://www.spotify.com/de/). First we are going to listen to some of the songs to find the ones that we prefer. Further, we are going to analyze that songs to have a good understanding of data that describes our preferation. Last we are going to use spotify for prediction. Thus we hope for a better analysis and understanding of the given data. Otherwise we would be comparing mostly different data that is not suitable for research purposes.
Alongside with the above analysis we also want to have some more general information about the artists and their songs. Therefore we are going to visualize some general information too.
After downloading and unzipping the data, one can see two different folders. The first one, ‘data’, containing several other folders and the second one ‘AdditionalFiles’, containing some adittional files in either SQL or txt format. The directory structure is based on The Echo Nest Track IDs 3. The ‘data’ folder contains exlusively songfiles in HDF5 (Hirarchical Data Format 5) format. This format is mostly used in science apllications for big datasets. It was developed by NASA 4 to handle large, heterogeneous and hirarchical datasets. The content of those files handles some analysis, some metadata and some more information that is stored on MusicBrainz (https://musicbrainz.org), an open music encyclopedia. The data availabla in ‘AdditionalFiles’ is going to be used for first hands on the whole dataset, to get to know the dataset since the access is simple. By doing so we will prevent some general information about the dataset. To read both datafolders one should install some additional packages that will be mentioned later on.
For more information about the dataset especially about the frequent asked questions we recomend to go to (https://labrosa.ee.columbia.edu/millionsong/faq).
When accessing the data provided in ‘AdditionalFiles’ folder, one has to remove the Seperators <SEP> and replace those with a common seperator like ‘;’. This should be done, because R is used to a one byte seperator and therefor it is not possible to read a file with a seperator like <SEP>.
The following codechunk was only used to access the txt files in RStudio.
# Load preprocessed data and name the columns
location <- read.csv2('data/subset_artist_location.txt',sep = ';', header = FALSE, col.names = c('artistId', 'lat','lon', 'trackID', 'artistName'))
artists <- read.csv2('data/subset_unique_artists.txt',sep = ';', header = FALSE, col.names = c('artistId', 'V2', 'trackID', 'artistName'))
tags <- read.csv2('data/subset_unique_mbtags.txt',sep = ';', header = FALSE, col.names = c('tags'))
uni_terms <- read.csv2('data/subset_unique_terms.txt',sep = ';', header = FALSE, col.names = c('terms Unique' ))
tracks <- read.csv2('data/subset_unique_tracks.txt',sep = ';', header = FALSE, col.names = c('trackID','V2', 'artistName','songName'))
tracksPerYear <- read.csv2('data/subset_tracks_per_year.txt',sep = ';', header = FALSE, col.names = c('Year', 'trackID', 'artistName','songName'))
The following code loads the packages that are required to make a wordcloud. Furthermore while creating a wordcloud, one will notice that the first created wordcloud, has a very bad distribution. Mostly because of the most common words in english language. Those words do not have a meaning for this purposes. Therefor, according to the observation and an wikipedia article 5, one should wipe up the dataset from this words. Thus the recomendation is to use ‘the’,‘and’ and ‘a’ to clean the dataset.
Describing the required packages, it is important to undesrstand what each package is used for in the following codechunk. Starting with ‘tm’ (Text Mining Package), that is common to use for wordcloud and handling different strings. Firstly one should take a closer look at Corpus that creates a collaction of corpora 6. Secondly one should create a Vector Source for the Corpus function and finally tm_map, which is an Interface that applies transformation functions to corpora objects. Also a very important function content_transformer is used to create a wrapper to get and set a content of a document. This steps where used to preprocess the documents. After doing so one should also consider to create a term document matrix, which contains every Term in documents and the documents it does appear in.
The package ‘wordcloud’ is a very usefull one, and does provide a graphical representation of the frequencies of used words in one or more documents 7. This wordclouds can be seen in the following plots.
# Load packages
# library("NLP")
library("tm") # for text mining
# library("SnowballC") # for text stemming
library("RColorBrewer") # color palettes
library("wordcloud") # word-cloud generator
docs <- Corpus(VectorSource(as.String(artists$artistName)))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# most common words in english that do not have a meaning for this puposes
others <- c('the','and','a')
# convert the found words to ''
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
for (i in 1:length(others)){
docs <- tm_map(docs, toSpace, others[i])
}
# calculate frequency of occuring words
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(words = d$word, freq = d$freq, min.freq = 1, scale = c(3,0.2),
max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
mtext('Artistnames', side = 2, line = 1, adj = 0.5) # title
When looking at the wordcloud above one can see, that the most common artistnames are eather orchestr or John. Also there are some spanisch artist names containing words like los. This could be used for a better knowledge about the dataset. It is more clear that artists do not only come from england europe or the US but also from Spain or Latin America. One could of course get rid of the prepositions in all languages the dataset contains. Thus the preposition los would alse be wiped out. Some other names like Joe or King are also quite common used. To make some more assumptions and to get a better understanding of the wordcloud, the actual frequencies of the very frequent entries where provided in a table.
# show only head of frequency dataFrame
head(d,8)
## word freq
## john john 41
## orchestr orchestr 38
## los los 31
## vid vid 31
## turing turing 25
## joe joe 21
## bro bro 19
## king king 19
Together with this table and the wordcloud one could gain a better understanding of the distribution of the artistnames in the given dataset. Now it is interesting to get some more facts about the most common name John. After a small research on the internet [research] one can see, that John was one of the most common names in the 1990’s. To proove, that this name occure mostly in the 1990’s in the dataset one should take a closer look on those years.
tracksPerYear$artistName[tracksPerYear$Year >= 1990 & tracksPerYear$Year <= 2000]
## [1] K's Choice K's Choice Kaija Koo
## [4] Kisha Lee Ritenour Les Malpolis
## [7] Lisa Lynne Los Amigos Invisibles Los Amigos Invisibles
## [10] Luciana Souza M.A. Numminen Mandi
## [13] Martin Sexton Martin Sexton Mithotyn
## [16] Mithotyn Monster Magnet Moonspell
## [19] Mudhoney Natural Elements Nic Endo
## [22] Old Man's Child OutKast
## 1149 Levels: !!! 2 Minutos 2-4 Grooves feat. Reki D. ... Zombina & The Skeletones
After displaing the actual dataset and the entries of the artistnames between the years 1990 and 2000, the assumption made before should be declined. However one can see another common word in the displayed subset ‘Los’. This set needs to be more described and explored, because the previous exploration does not provide a lot of information.
Almost the same analysis was done on common songnames. However the common words in this case where not quit the same as in the script before. The method to find common songnames was firstly plot the wordcloud as an uncleaned version, containing all possible words. After deciding which words do not have a proper meaning to the final statement it is obvious to delete those words. Thus the cleaning with found words like ‘the’,‘version’,‘and’,‘from’, ‘feat’ and ‘album’ created the following wordcloud.
# Load packages
#library("NLP")
#library("tm") # for text mining
#library("SnowballC") # for text stemming
#library("RColorBrewer") # color palettes
#library("wordcloud") # word-cloud generator
docs1 <- Corpus(VectorSource(as.character(tracks$songName)))
# Convert the text to lower case
docs1 <- tm_map(docs1, content_transformer(tolower))
others <- c('the','version','and','from', 'feat','album')
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
for (i in 1:length(others)){
docs1 <- tm_map(docs1, toSpace, others[i])
}
dtm <- TermDocumentMatrix(docs1)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"), main = "TITL")
mtext('Songnames', side = 3, line = 0, adj = 0.5) # title
Looking at the result one can see the frequently words ‘you’ and ‘love’. Interpreting this result, it is obvious that this dataset consists of Songnames that are more likely to handle Love and the counterpart of a Human, you. A general assumption could be, that there are more songs handling Love, the counterpart of someone and the live, then about technic or traveling for example. However this assumption can not be completle prooven since this dataset does not represent all the songnames in the world.
Also by looking at the folowing table, one can have a better and more detailed information about the distribution of the songnames.
head(d,7)
## word freq
## you you 540
## love love 332
## live live 216
## for for 185
## all all 144
## your your 143
## don don 137
Since it is clear that the dataset not only contains artists from england or europe or the US, it would be nice to have a proper listing of the world together with the location of the artists. This can be achived through the package ‘maps’ 8. This package provides not only a method to draw a map by accsessing it through a word like world but also by giving this method a border by longitude and latitude to get a closer look on different parts of it. It is easy to use and draw complex maps as well as set some points on the map. The worldmap below shows all artists with their locations. Unfortinately the dataset does not provide a location for each containing artist, but nevertheless it creates a good overview.
library(maps)
#library(mapdata)
#library(eurostat)
# parse the lat and lon values of given set
lon <- as.double(as.character(location$lon))
lat <- as.double(as.character(location$lat))
# delete all NaN
lon <- lon[!is.na(lon)]
lat <- lat[!is.na(lat)]
coordinates <- as.data.frame(cbind(lon, lat))
# take a closer look at europe
#europe <- as.data.frame(cbind(lon = c(54.78333, 24.08464, -31.26192, 59.34569), lat = c(80.56667, 34.83469, 39.45479, 62.21215)))
map('world',c('.'), col = "grey80", fill = TRUE, border = "grey40")
points(coordinates$lon, coordinates$lat, col = "red", cex = .1)
#x <- map('world', xlim = range(europe$lon), ylim = range(europe$lat), namefield = TRUE)
#x$names <- gsub("\\:.*","",x$names)
map(col = "grey80", border = "grey40", fill = TRUE,
xlim = c(-25, 45), ylim = c(36, 70), mar = rep(0.1, 4))
points(coordinates$lon, coordinates$lat, col = "red", cex = .3)
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5) # required for H5 files
# set a hardcoded Path to the MillionSongSubset
pathToSet = '/Users/Kostja/Desktop/Master/Sem 2 (18 SoSe)/Data Visualization/Tasks/MillionSongSubset'
# create array with found Ids in beforehand containing prefered songs
TrackIDs <- array(c('TRAPZTV128F92CAA4E','TRANNZZ128F92C22F7','TRAQZQX128F931338F','TRALONM128EF35A199','TRAWBHE12903CBC4CB'))
# find automaticaly all paths with names of trackIDs
SubPaths <- lapply(TrackIDs,function(x){
list.files(pathToSet, x, recursive=TRUE, full.names=TRUE, include.dirs=TRUE)
})
# beautify the dataset
SubPaths <- data.frame(SubPaths = t(unlist(SubPaths)))
names(SubPaths) <- c('beyonce', 'justin', 'kanye', 'madonna', 'bruno')
# read the H5 files and create a readable output
artist <- lapply(SubPaths, function(x){
h5ls(toString(x))
})
Analyze_song <- apply(SubPaths,2,function(x){
h5read(x,"/analysis/songs")
})
Analyze_song <- do.call(rbind, Analyze_song)
Meta_song <- apply(SubPaths,2,function(x){
h5read(x,"/metadata/songs")
})
Meta_song <- do.call(rbind, Meta_song)
library(fmsb)
radarFrame <- function(df1, df2){
matrix <- cbind('artist_familiarity' = df1$artist_familiarity, 'artist_hotttnesss' = df1$artist_hotttnesss, 'tempo'= df2$tempo, 'time_signature' = df2$time_signature, 'loudness' = df2$loudness, 'key' = df2$key)
rownames(matrix) <- rownames(df1)
matrix <- data.frame(matrix)
}
namesLegend <- paste(Meta_song$artist_name,Meta_song$title)
radar <- function(df, namesLeg = namesLegend, x = -2.8 , y= -1.1){
transparency <- adjustcolor(1:dim(df)[1], alpha.f = 0.2)
# Custom the radarChart !
radarchart( df , axistype=1 , maxmin = FALSE,
#custom polygon
pcol=1:dim(df)[1], plwd=1 , pfcol = transparency ,
#custom the grid
cglcol="grey", cglty=1, axislabcol=FALSE ,
#custom labels
vlcex=0.8
)
par(xpd=TRUE)
legend(x,y, legend = namesLeg, bty = "n", pch=20 , col=1:dim(df)[1] , cex=0.8, pt.cex=2)
}
data <- radarFrame(Meta_song, Analyze_song)
radar(data)
# anschauen für radar
# artist familarity unter metadata
# hotness sind aber estimateionen dh von EchoNest berechnet und schwierig in der absoluten umgehensweise
# tempo in songs vergleichen mit anderer Seite weil nicht ganz richtig
# time signature in songs auch mit anderer Seite vergleichen beides aus dem gleichen Datensatz daher auch der gleiche Fehler, wenn nun anderer datensatz dazukommt kann es dazu kommen, dass der Fehler nicht mehr reproduzierbar ist und der bias komplett verfälscht wird und wir somit keine Aussage mehr treffen können.
# loudnes in songs
# key in songs
# Alles was oben ist von einer anderen Seite daten nehmen und radar plot erstellen zum vergleich
# loudnes max als detailierter wert
compareFrame <- data.frame(rbind(
beyonce = c('familiarity' = 70, 'tempo' = 97, 'time_signature' = 4, 'loudness' = -5,'key' = 1),
justin = c('familiarity' = 70, 'tempo' = 76, 'time_signature' = 4, 'loudness' = -5,'key' = 7),
kanye = c('familiarity' = 65, 'tempo' = 106, 'time_signature' = 4, 'loudness' = -5,'key' = 9),
madonna = c('familiarity' = 54, 'tempo' = 119, 'time_signature' = 4, 'loudness' = -7,'key' = 9),
bruno = c('familiarity' = 70, 'tempo' = 104, 'time_signature' = 4, 'loudness' = -6,'key' = 10)
))
# because all timesignatuires are 4, there is no proper graph
# radarchart draws relatively
radar(compareFrame)
# not realy comparable as seen
par(mfrow = c(1,2))
radar(data,x=-2.2, y = -1.2)
radar(compareFrame,x=-2.2)
par(mfrow = c(1,1))
beyonce trackid TRAPZTV128F92CAA4E justin trackid TRANNZZ128F92C22F7 kanye trackid TRAQZQX128F931338F madonna trackid TRALONM128EF35A199 bruno mars TRAWBHE12903CBC4CB
# library(fmsb)
# Tune_Beyance
# Tune_Justin <- c(,,76,,-5,8)
# Tune_Kanye
# Tune_Bruno
# Tune_Madonna
loudness_start <- apply(SubPaths,2,function(x){
h5read(x,"/analysis/segments_loudness_start")
})
loudness_max <- apply(SubPaths,2,function(x){
h5read(x,"/analysis/segments_loudness_max")
})
par(mfrow= c(1,2))
boxplot(loudness_start, main = 'loudness_start' )
boxplot(loudness_max, main = 'loudness_max' )
mtext('Boxplots of loudness', outer = TRUE, side = 3, line = -1)
par(mfrow= c(1,1))
Draw_matrix_plots <- function(plt){
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0), 2, byrow = TRUE), heights=c(2,2))
c <- 0
invisible(lapply(plt,function(x){
c <<- c+1
plot(x,type = 'l', axes = FALSE, xlab = '', ylab = '', main = names(plt)[c])
axis(2)
axis(1)
}))
mtext(paste('Plot', deparse(substitute(plt)),'for different interprets' ), side = 3, line = -19, outer = TRUE)
par(mfrow=c(1,1))
}
Draw_matrix_plots(loudness_start)
Draw_matrix_plots(loudness_max)
matplot_Draw <- function(plt){
dFrame <- do.call(cbind, plt)
matplot(dFrame,type = "l", col = 1:dim(dFrame)[2], ylab = "loudness", xlab = 'segmentstep', main = paste('matplot', deparse(substitute(plt))))
legend("topleft", legend = names(plt), col = 1:dim(dFrame)[2], pch = 16)
}
matplot_Draw(loudness_start)
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 2)
matplot_Draw(loudness_max)
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 2)
# nicht sicher mit dem hier
Analyze_pitch <- apply(SubPaths,2,function(x){
h5read(x,"/analysis/segments_pitches")
})
boxplot(Analyze_pitch)
Analyze_timbre <- apply(SubPaths,2,function(x){
h5read(x,"/analysis/segments_timbre")
})
boxplot(Analyze_timbre)
The H5 data explained: https://labrosa.ee.columbia.edu/millionsong/pages/example-track-description
european limits : http://www.milanor.net/blog/maps-in-r-introduction-drawing-the-map-of-europe/ vergleichseiten: http://www.findsongtempo.com und http://www.tunebat.com
https://en.wikipedia.org/wiki/The_Echo_Nest page view [02.07.18]↩
https://labrosa.ee.columbia.edu/millionsong/ page view [02.07.18]↩
TR+LETTERS + LETTERS&NUMBERS so the directorypath within the dataset is based on the first 3 letters after the 3rd one e.i ‘MillionSong/data/A/D/H/TRADHRX12903CD3866.h5’↩
(National Aeronautics and Space Administration) https://www.nasa.gov/about/index.html page view [02.07.18]↩
(https://en.wikipedia.org/wiki/Most_common_words_in_English) page view [26.06.18]↩
(https://cran.r-project.org/web/packages/tm/tm.pdf) page view [26.06.18]↩
(https://cran.r-project.org/web/packages/wordcloud/wordcloud.pdf) page view [27.06.18]↩
https://cran.r-project.org/web/packages/maps/maps.pdf page view [25.06.18]↩